Attentive Variational Information Bottleneck for TCR–peptide interaction prediction

Abstract Motivation We present a multi-sequence generalization of Variational Information Bottleneck and call the resulting model Attentive Variational Information Bottleneck (AVIB). Our AVIB model leverages multi-head self-attention to implicitly approximate a posterior distribution over latent encodings conditioned on multiple input sequences. We apply AVIB to a fundamental immuno-oncology problem: predicting the interactions between T-cell receptors (TCRs) and peptides. Results Experimental results on various datasets show that AVIB significantly outperforms state-of-the-art methods for TCR–peptide interaction prediction. Additionally, we show that the latent posterior distribution learned by AVIB is particularly effective for the unsupervised detection of out-of-distribution amino acid sequences. Availability and implementation The code and the data used for this study are publicly available at: https://github.com/nec-research/vibtcr. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Predicting whether T cells recognize peptides presented on cells is a fundamental step towards the development of personalized treatments to enhance the immune response, like therapeutic cancer vaccines (Buhrman and Slansky, 2013;Corse et al., 2011;Hundal et al., 2020;McMahan et al., 2006;Meng and Butterfield, 2002;Slansky et al., 2000). In the human immune system, T cells monitor the health status of cells by identifying foreign peptides on their surface (Davis and Bjorkman, 1988;Krogsgaard and Davis, 2005). The Tcell receptors (TCRs) are able to bind to these peptides, especially if they originate from an infected or cancerous cell. The binding of TCRs-also known as TCR recognition-with peptides, presented by major histocompatibility complex (MHC) molecules in peptide-MHC (pMHC) complexes, constitutes a necessary step for immune response (Glanville et al., 2017;Rowen et al., 1996). Only if TCR recognition takes place can cytokines be released, which leads to the death of a target cell.
TCRs consist of an a-and a b-chain whose structures determine the interaction with the pMHC complex. Each chain consists of three loops, referred to as complementarity-determining regions (CDR1-3). It is believed that the CDR3 loops primarily interact with the peptide of a given pMHC complex (Feng et al., 2007;La Gruta et al., 2018;Rossjohn et al., 2015). Supplementary Material S1 depicts the 3D structure of a TCR-pMHC complex.
Recent discoveries (Dash et al., 2017;Lanzarotti et al., 2019) have demonstrated that both the CDR3a-and b-chains carry information on the specificity of the TCR toward its cognate pMHC target. Obtaining information about paired TCR a-and b-chains requires specific and expensive experiments, like single-cell (SC) sequencing, which limits its availability. Conversely, the bulk sequencing of a population of cells reactive to a peptide is cheaper, but it only allows to gain information about either the a-or the b-chain.
In this work, we propose Attentive Variational Information Bottleneck (AVIB) to predict TCR-peptide interactions. AVIB is a multi-sequence generalization of Variational Information Bottleneck (VIB) (Alemi et al., 2016). Notably, we introduce Attention of Experts (AoE), a novel method for combining single-sequence latent distributions into a joint multi-sequence latent encoding distribution 1.1 Background and related works 1.1.1 TCR-pMHC and TCR-peptide interaction prediction Several recent works have investigated TCR-pMHC and TCR-peptide interaction prediction. Various proposed approaches operate simple CDR3b sequence alignment (Chronister et al., 2021;Wong et al., 2019). TCRdist computes CDR similarity-weighted distances (Dash et al., 2017). SETE adopts k-mer feature spaces in combination with principal component analysis and decision trees (Tong et al., 2020). Various methods adopt Random Forest to operate classification (De Neuter et al., 2018;Gielis et al., 2019;Springer et al., 2020). ImRex tackles the problem with a method based on convolutional neural networks (CNNs) (Moris et al., 2021). TCRGP is a classification method which leverages a Gaussian process (Jokinen et al., 2019). ERGO is a deep learning approach which adopts long short-term memory networks and autoencoders to compute representations of peptide and CDR3b (Springer et al., 2020). ERGO II (Springer et al., 2021) is an updated version of ERGO which considers additional input data, i.e. CDR3a sequence, V and J genes, MHC and T-cell type. NetTCR-1.0 (Jurtz et al., 2018) and NetTCR-2.0 (Montemurro et al., 2021) propose a simple 1D CNNbased model, integrating peptide and CDR3 sequence information for the prediction of TCR-peptide specificity. TITAN (Weber et al., 2021) is a bimodal neural network that explicitly encodes b-chain and peptide; it leverages transfer learning and SMILES (Weininger et al., 1989) encoding to achieve good generalization.

Deep multimodal variational inference
The problem investigated in this work consists in predicting whether multiple sequences of amino acids, i.e. a peptide and the CDR3s, bind. A single sequence is not informative of whether binding takes place or not when observed alone. As a consequence, binding prediction cannot be framed as a classical multimodal learning problem. Nevertheless, this work presents a strong relationship with multimodal variational inference and takes inspiration from it. In this section, related works from both the supervised and self-supervised learning domains are presented.
Self-supervised learning. Deep neural networks proved to be successful at modeling probability distributions in the context of Variational Bayes (VB) methods. The Variational Autoencoder (VAE) (Kingma and Welling, 2013) jointly trains a generative model from latent variables to observations with an inference network from observations to latent variables. Multimodal generalizations of the VAE shall tackle the problem of learning a joint posterior distribution of the latent variable conditioned on multiple input modalities. The Multimodal Variational Autoencoder (MVAE) (Wu and Goodman, 2018) models the joint posterior as a Product of Experts (PoE) over the marginal posteriors, enabling cross-modal generation at test time. The Mixture-of-experts Multimodal Variational Autoencoder (MMVAE) (Shi et al., 2019) factorizes the joint variational posterior as a combination of unimodal posteriors, using a Mixture of Experts (MoE). MoE-based models have been used in the biomedical field to tackle challenges such as protein-protein interactions (Qi et al., 2007), biomolecular sequence annotation (Caragea et al., 2009) and clustering cell phenotypes from SC data (Kopf et al., 2021). Their main advantage is that they can infer global patterns in the genetic or peptide sequences in supervised and unsupervised settings (Kopf et al., 2021).
Supervised learning. The VIB (Alemi et al., 2016) is to supervised learning what the b-VAE (Higgins et al., 2017) is to unsupervised learning. VIB leverages variational inference to construct a lower bound on the Information Bottleneck (IB) objective (Tishby et al., 2000). By applying the reparameterization trick (Kingma and Welling, 2013), Monte Carlo sampling is used to get an unbiased estimate of the gradient of the VIB objective. This allows using stochastic gradient descent to optimize the objective. Various multimodal generalizations of the VIB have been recently proposed: the Multimodal Variational Information Bottleneck (MVIB) (Grazioli et al., 2022a) and DeepIMV (Lee and Schaar, 2021). Both MVIB and DeepIMV adopt the PoE to estimate a joint multimodal latent encoding distribution from the unimodal latent encoding distributions. In contrast, our AVIB model predicts interactions among multiple input sequences. This involves modeling complex relations among different sequences (analogous to but not the same as modalities) with powerful and flexible multi-head self-attention, for which PoE is a sub-optimal choice.

Materials and methods
Let Y be a random variable representing a ground truth label associated with an input random variable X. Let Z be a stochastic encoding of X defined by an encoder p h ðZjxÞ parameterized by a neural network. (Notation: X, Y, Z are random variables; x, y, z are their realizations; f ðÁ; hÞ and p h ðÁÞ are functions and probability distributions parameterized by a vector h; S represents a set.) Following Tishby et al. (2000), our goal consists in learning an encoding Z which is (a) maximally informative about Y and (b) maximally compressive about X. Using an information theoretic approach, we obtain the IB objective with pðX; Y; ZÞ ¼ pðZjXÞpðYjXÞpðXÞ: where b ! 0 is a Lagrange multiplier controlling the trade-off between (a) and (b) and IðZ; Y; hÞ is the mutual information between Z and Y parameterized by h: IðZ; Y; hÞ ¼ ð pðz; yjhÞ log pðz; yjhÞ pðzjhÞpðyjhÞ dy dz : As derived in Alemi et al. (2016), assuming q / ðyjzÞ and r x ðzÞ are variational approximations of the true pðyjzÞ and pðzÞ, respectively, Equation 1 can be rewritten as: E $pðÞ ½Àlog q / ðy n jf ðx n ; ; hÞÞ þbD KL p h ðZjx n Þjjr x ðZÞ ; where e $ N ð0; IÞ is an auxiliary Gaussian noise variable, D KL is the Kullback-Leibler divergence and f ðÁ; hÞ is a vector-valued parametric deterministic encoding function (here a neural network). The reparameterization trick (Kingma and Welling, 2013) introduces e and allows writing p h ðzjxÞdz ¼ pðeÞde, where z ¼ f ðx; e; hÞ is now treated as a deterministic variable. This formulation allows the noise variable to be independent of the model parameters and to compute gradients of the objective in Equation 3 and optimize via backpropagation. In this work, we let the latent encoding distribution on Z be a multivariate Gaussian distribution with a diagonal covariance structure z $ p h ðZjxÞ ¼ N ðl; diagðr 2 ÞÞ; a valid reparameterization is z ¼ l þ r e. With the variational distribution r x ðZÞ set to a standard multivariate Gaussian distribution N ð0; IÞ as done in practice, we can view VIB as a variational encoder-decoder analogous to VAE (Kingma and Welling, 2013), in which the latent encoding distribution p h can be viewed as a latent posterior, and the variational decoding distribution q / can be viewed as a decoder.
In the same spirit of extending VAE (Kingma and Welling, 2013) to MVAE (Wu and Goodman, 2018), the VIB objective of Equation 3 can be generalized by representing X as a collection of multiple input sequences X ¼ fX i ji th sequence is presentg. In light of this, in the language of a variational encoder-decoder, the posterior p h ðZjxÞ of Equation 3 consists actually in the joint posterior p H ðZjx 1 ; . . . ; x M Þ :¼ p H ðZjx 1:M Þ, conditioned on the joint M available sequences. However, for predicting the interaction label Y from X, the M different sequences cannot be simply treated as M different modalities.

Attention of experts
Similar to previous works (Grazioli et al., 2022a;Wu and Goodman, 2018), the single-sequence posteriors are modeled as Gaussian distributions with diagonal structure:q hi ðZjx i Þ ¼ N ðl i ; diagðr 2 i ÞÞ. By stacking the parameters (represented as column-vectors) l 0 and r 0 of the latent prior with the l i and r i for all available i ¼ 1; . . . ; M sequences, we define the following two matrices M 2 R ðMþ1ÞÂdZ and R 2 R ðMþ1ÞÂdZ , where d Z is the dimensionality of the latent single-sequence posteriors: We propose to implicitly learn the dependencies between the M single-sequence posteriors and the multi-sequence joint posterior by means of multi-head self-attention, leveraging its power of capturing multiple complex interactions in X, and allowing possible missing sequences: MultiHead is the standard multi-head attention block defined in Vaswani et al. (2017), whose equations are provided in Supplementary Material S2. We refer to Equation 5 as AoE. Figure 1 provides a schematic depiction of AoE. We refer to a multi-sequence VIB which adopts AoE for modeling the multi-sequence joint posterior as AVIB. The AVIB objective is: where the multi-sequence posterior is modeled as p H ðZjx 1:M Þ ¼ N ðl AoE ; diagðr 2 AoE ÞÞ. Due to space limitation, we provide detailed description of the implementation, the training setup and the choice of the hyperparameter b in Supplementary Material S3. Supplementary Material S4 describes the full AVIB architecture.

Relation to multimodal variational inference
MVAE (Wu and Goodman, 2018) and MVIB (Grazioli et al., 2022a) approximate the joint posterior p H ðZjx 1:M Þ assuming that the M modalities are conditionally independent, given the common latent variable Z. This allows expressing the joint posterior as a product of unimodal approximate posteriorsq hi ðZjx i Þ and a prior p(Z), referred to as PoE: (Shi et al., 2019) factorizes the joint multimodal posterior as a mixture of Gaussian unimodal posteriors, referred to as MoE: PoE assumes conditional independence between modalities (Hinton, 2002). Furthermore, conditional dependence is impossible to capture by MoE, due to its additive form. This becomes a major shortcoming when modeling TCR-peptide interaction, in which the single sequences are not predictive of the binding if observed individually. Although AoE does not explicitly parameterize conditional dependence between the sequences, it does not assume that each sequence should be individually predictive of the class label, making it a more suitable candidate to model molecular interactions.
AoE can improve on PoE and MoE on multiple levels. First, employing attention for estimating the joint multi-sequence posterior allows learning relative importance among the various singlesequence posteriors. This allows dynamically enhancing the weight given to certain input sequences, while diminishing the focus on others, without being restrained to 'AND' and 'OR' relations, like PoE and MoE, respectively (Shi et al., 2019). Second, as AoE is a parametric trainable module, it can learn to accommodate miscalibrated single-sequence posteriors, which are especially difficult to handle by PoE (Kutuzova et al., 2021).
The adoption of PoE and MoE for the approximation of a multimodal posterior using unimodal encoders allows for inference also in case certain modalities are missing (Grazioli et al., 2022a;Kutuzova et al., 2021;Shi et al., 2019;Wu and Goodman, 2018). A single encoder applied on the concatenation of all modalities would not allow that. Just like PoE and MoE, AoE allows inference with missing inputs. There is in fact no restriction on the number of rows of M and R (see Equations 4 and 5), which is the equivalent of the number of word tokens in a natural language processing setting (see Section 3.5).
In this work, we only benchmark AoE against PoE and do not compare against MoE. We believe MoE's 'OR'-nature is not suitable for modeling the chemical specificity of multiple molecules. If taken alone, the single-sequence variational posteriors are not informative of the chemical reaction. Analogously, sampling from a MoE-which has similarities to the OR operator-is not suitable for capturing how molecules chemically interact.

Information Bottleneck Mahalanobis distance
Although AVIB is not explicitly designed for uncertainty estimation, we propose a simple, yet effective, approach for OOD detection. This approach is strongly inspired by Lee et al. (2018) and leverages the Mahalanobis distance. In the following, we first summarize the method proposed by Lee et al. (2018). Then, we describe how to extend this approach to AVIB.
Mahalanobis distance. The Mahalanobis distance has proved to be an effective metric for OOD detection (Lee et al., 2018). Let f l ðxÞ denote the output of the lth hidden layer of a neural network, given an input x. Using the training samples, this method fits a class-conditional Gaussian distribution to the embeddings of each class, computing a per-class mean l c l ¼ 1 Nc P i:yi¼c f l ðx i Þ and a shared covariance matrix R l ¼ 1 N P K c¼1 P i:yi¼c ðf l ðx i Þ À l c l Þðf l ðx i Þ À l c l Þ T . Given a test sample x, the Mahalanobis score is computed as score Maha ðxÞ ¼ P l a l M l ðxÞ, where M l ðxÞ ¼ Àmin c ð 1 2 ðf l ðxÞ À l c l ÞR À1 l ðf l ðxÞ À l c l Þ T Þ. Lee et al. (2018) fit the a l coefficients by training a logistic regression on a set of samples for which the knowledge of the OOD/ID label is assumed. Additionally, the authors show that adding a small (e) controlled noise to the input can improve results, analogously to ODIN (Liang et al., 2017).
We leverage the expectation of the learned latent posterior conditioned on all input sequences and fit K class-conditional Gaussian distributions using the ID training samples, where K is the number of classes. For the TCR-peptide interaction prediction task, we have two classes: binders and non-binders. The K empirical class means are computed as: A shared covariance matrix is computed as: Lee et al. (2018) compute Mahalanobis distances at multiple hidden layers of a neural network. A logistic regression is trained to learn relative weights assuming the knowledge of the ID/OOD label for a set of validation samples. In contrast to that, given a multisequence sample x 1:M , we propose to leverage the multi-sequence posterior distribution over encodings learned by AVIB, i.e. p H ðZjx 1:M Þ and compute one single Mahalanobis distance, which acts as OOD score: This approach is hyperparameter free. Hence, it does not require a validation set for tuning. As a consequence, prior knowledge of OOD validation samples is not required.

Results and discussion
First, we provide a description of the datasets used in this work. We then apply AVIB to the TCR-peptide interaction prediction problem. Last, we demonstrate AVIB's effectiveness in the context of OOD detection. All experiments are implemented using PyTorch (Paszke et al., 2019). Code and data are publicly available at: https://github.com/nec-research/vibtcr.
a 1 b set. 117 753 samples out of 271 366 present peptide information, along with both CDR3a and CDR3b sequences. In this work, we refer to this subset as the a þ b set. The ground truth label is a binary variable which represents whether the peptide and TCR chains interact.
b set. 153 613 samples out of 271 366 present peptide and CDR3b information (the CDR3a sequence is missing). We refer to this subset as the b set. The b set and the a þ b set are disjoint.
Human TCR set. We refer to the totality of the human TCRpeptide data (i.e. b set [ a þ b set) as Human TCR set.
Non-human TCR set. We extract 5036 non-human TCR samples from the VDJdb database, which we use as OOD samples. These samples come from mice and macaques and present peptide and CDR3b information. We refer to these samples as Non-human TCR set.
In addition to the TCR datasets, in order to thoroughly evaluate AVIB on multiple types of molecular data, we perform experiments on the following peptide-MHC datasets.
NetMHCIIpan-4.0 set. This dataset consists of 108 959 peptide-MHC pairs and was proposed in Reynisson et al. (2020) for the training of the NetMHCIIpan-4.0 model. All MHC molecules are class II. A continuous binding affinity (BA) value, ranging in ½0; 1, is associated to each (peptide, MHC) pair and used to validate AVIB on a regression task.
Human MHC set. We create a second set of OOD samples composed of 463 684 peptide-MHC pairs. The peptide sequences are taken from the Human TCR set, i.e. the peptide information is shared among ID and OOD sets. The MHC molecules are represented as pseudo-sequences of amino acids. [For the MHC pseudo-sequences, we refer to the PUFFIN (Zeng and Gifford, 2019) repository: https://github.com/gifford-lab/PUFFIN/blob/master/data/ pseudosequence.2016.all.X.dat.] We consider both Classes I and II MHC alleles. We refer to these samples as Human MHC set.
Supplementary Figure

Pre-processing
In this work, peptides, CDR3a and CDR3b are represented as sequences of amino acids. The 20 amino acids translated by the genetic code are in general represented as English alphabet letters. Analogously to Montemurro et al. (2021), we pre-process the amino acid sequences using BLOSUM50 encodings (Henikoff and Henikoff, 1992), i.e. the substitution value of each amino acid represented by the BLOSUM50 matrix' diagonal. This allows us to represent a sequence of N amino acids as a 20 Â N matrix, analogously to the approach proposed by Nielsen et al. (2003). After performing BLOSUM50 encoding, we standardize the features by subtracting the mean and scaling to unit variance. As the length of the amino acid sequences is not constant, we operate 0-padding after the BLOSUM50 encoding (Mö sch and Frishman, 2021). This ensures that all matrices have shape 20 Â N max , where N max is the length of the longest sequence. Information on the length distribution of the amino acid sequences can be found in Supplementary Material S5.1.

TCR-peptide interaction prediction
In order to evaluate AVIB's performance on the TCR-peptide interaction prediction task, we perform experiments on three datasets: the a þ b set, the b set and their union b set [ a þ b set. For the b set and the union set, input samples are ðx Peptide ; x CDR3b Þ pairs. For the a þ b set, inputs can be either ðx Peptide ; x CDR3b Þ pairs or ðx Peptide ; x CDR3a ; x CDR3b Þ triples. For all tri-sequence experiments, we adopt a full multi-sequence extension of the J AVIB objective (see Supplementary Material S6, Equation 12).
Evaluation metrics. Table 1 summarizes the experimental results. For evaluation, the area under the receiver operator characteristic (AUROC) curve, the area under the precision-recall (AUPR) curve and the F1 score (F1) are computed on the test sets. Five repeated experiments with different 80/20 training/test random splits are performed for robust evaluation.
Peptide1CDR3b. On the b set, AVIB obtains $4% higher AUROC and 8% higher AUPR compared to the best baseline, ERGO II. On the b set [ a þ b set, AVIB outperforms ERGO II by achieving $3% higher AUROC and $4% higher AUPR. On the a þ b set, in the peptideþCDR3b setting, AVIB compares with ERGO II.
Peptide1CDR3a. PeptideþCDR3a results on the a þ b set are reported in Supplementary Material S7.
These experimental results demonstrate that AVIB is a competitive method for TCR-peptide interaction prediction. On the a þ b set, AVIB's tri-sequence (peptideþCDR3aþCDR3b) results outperform the results obtained in both bi-sequence (peptideþCDR3a and peptideþCDR3b) settings (see Table 1 and Supplementary Material S7). This shows that AVIB is an effective multi-sequence learning method, which can learn richer representations from the joint analysis of multiple data sequences.

Cross-dataset experiments
In Supplementary Material S8, we present cross-dataset experiments, in which we train AVIB and the baseline models on the a þ b set and test on the the b set. As shown in Supplementary Figure S7, the a þ b set and the b set present similar peptide distributions, but contain different CDR3b sequences. Our cross-dataset results show that all models fail to generalize to unseen CDR3b sequences. These results are in line with Grazioli et al. (2022b), which analogously shows that state-of-the-art models fail to generalize to unseen peptides.

Visualization of the attention weights
One of the advantages of using AoE for estimating the multisequence posterior is the dynamic weighting of the multiple single-sequence posteriors. This allows to capture relationships between the input sequences. In Supplementary Material S9, we show how the attention weights derived from the l AoE selfattention block change while gradually mutating the peptide sequence. We notice, that as the peptide sequence disruption increases, the peptide-CDR3b attention weight drops while CDR3b-peptide increases.

Multi-sequence posterior approximation
In this section, we compare various techniques to approximate Gaussian joint posteriors. We perform experiments and benchmark on two datasets: a þ b set and NetMHCIIpan-4.0 set. Experiments on the a þ b set employ either ðx Peptide ; x CDR3b Þ pairs or ðx Peptide ; x CDR3a ; x CDR3b Þ triples as inputs. Experiments on the NetMHCIIpan-4.0 set input ðx Peptide ; x MHC Þ pairs.
The ground truth labels of the NetMHCIIpan-4.0 set are continuous BA scores. For BA regression, we train models by substituting the log-likelihood of Equation 6 with a mean squared error (MSE) loss. BA prediction of pMHC complexes is-just like TCRpeptide interaction prediction-a fundamental problem in computational immuno-oncology (Cheng et al., 2021;O'Donnell et al., 2018O'Donnell et al., , 2020Reynisson et al., 2020) and is a key step in the development of vaccines against cancer (Buhrman and Slansky, 2013;Corse et al., 2011;Hundal et al., 2020;McMahan et al., 2006;Meng and Butterfield, 2002;Slansky et al., 2000) and infectious diseases (Malone et al., 2020). Peptides can only be presented on the surface of cells if they bind to MHC molecules. This mechanism allows the immune system to gain knowledge about in-cell anomalies such as cancerous mutations or viral infections.
Baseline and ablation methods. We benchmark AVIB, which employs AoE, against MVIB (Grazioli et al., 2022a), which employs PoE. Additionally, we perform an ablation study meant to investigate the influence of multi-head self-attention in AoE. For the ablation, we remove the multi-head self-attention module from AoE (see Equation 5) and only operate a simple pooling of the various singlesequence posteriors. We define two ablation methods: Max Pooling of Experts (MaxPOOLoE), which adopts a 1D max pooling function and Average Pooling of Experts (AvgPOOLoE), which adopts 1D average pooling.
Evaluation metrics. For the evaluation of classification results on the a þ b set, we adopt AUROC, AUPR, F1 and accuracy. For evaluating regression on the NetMHCIIpan-4.0 set, we employ MSE, root mean squared error (RMSE) and the R 2 coefficient (Wright, 1921). Table 2 presents classification and regression results on the a þ b set and the NetMHCIIpan-4.0 set. AoE achieves best results in all settings and on both datasets. Interestingly, the ablation methods  (Montemurro et al., 2021), ERGO II (Springer et al., 2021) and LUPI-SVM (Abbasi et al., 2018). Best results are in bold.
AvgPOOLoE and MaxPOOLoE (Supplementary Material S10) achieve worse performance compared to PoE.

Missing input sequences
In this section, we study AVIB's performance when certain data sequences are available at training time, but missing at test time. We train AVIB on ðx Peptide ; x CDR3a ; x CDR3b Þ triples from the a þ b set. At test time, we omit one of the two CDR3 sequences. In real-world settings, it is in fact common to have batches of data where only CDR3a or CDR3b information is available. It is therefore efficient to leverage one single model which can operate also if a CDR3 sequence is missing. This prevents the need of training different models on the various sequences subsets. Figure 2 presents the experimental results. As expected, AVIB performance decreases when a CDR3 sequence is missing at test time. However, the performance achieved by AVIB when trained in the tri-sequence setting and tested on missing sequences is not consistently different to the performance deriving from a bi-sequence training. We only observe a significant difference in the AUPR score when the CDRa sequence is missing: AVIB trained on peptideþCDR3b achieves $3% higher AUPR than AVIB trained on peptideþCDR3aþCDR3b and tested with missing CDR3a. Alemi et al. (2018) show that VIB has the ability to detect OOD samples. In this section, the OOD detection capabilities of AVIB are investigated. We assume that we have an in-distribution (ID) dataset We leverage the expectation of the learned latent posterior conditioned on all input sequences and fit two class-conditional Gaussian distributions using the ID training samples, one for the binding samples and one for the non-binding ones (Equation 7). The class-conditional Gaussian distributions share the same covariance matrix (Equation 8). Analogously to Lee et al. (2018), we discriminate whether test samples are ID or OOD using the Mahalanobis distance score (AVIB-Maha) (Equation 9).

OOD detection
Training and test sets for OOD detection. Given a pair ðD ID ; D OOD Þ, we operate a random 80/20 training/test split of D ID into D ID train and D ID test . We train AVIB on D ID train for TCR-peptide interaction prediction. No OOD samples are available at training time. We ensure that the number of ID and OOD samples in the test set is balanced by applying the procedure described in Supplementary Material S11.1. Experiments are repeated five times with different random training/test splits.
Evaluation metrics. As evaluation metrics, in addition to AUROC and AUPR, we adopt the false positive rate at 95% true positive rate (FPR @ 95% TPR) and the detection error (see Supplementary Material S11.3). Table 3 summarizes the OOD detection results for AVIB trained on the Human TCR set for TCR-peptide interaction prediction and using the Non-human TCR set and the Human MHC set as OOD datasets. Figure 3 shows the ROC and PR curves. AVIB-Maha achieves best results on all investigated metrics on both OOD datasets. On the Non-human TCR set, AVIB-Maha outperforms AVIB-R by $9% AUROC and >15% AUPR. On the Human MHC set, AVIB-Maha outperforms AVIB-R by $29% FPR at 95% TPR and $15% detection error.

Conclusion
In this article, we propose AVIB, a multi-sequence generalization of the Variational Information Bottleneck (Alemi et al., 2016), which uses AoE to implicitly approximate the posterior distribution over latent encodings conditioned on multiple input sequences. We apply AVIB to the TCR-peptide interaction prediction problem, a   fundamental challenge in immuno-oncology. We show that our method significantly improves on the state-of-the-art baselines ERGO II (Springer et al., 2021) and NetTCR-2.0 (Montemurro et al., 2021). We demonstrate the effectiveness of AoE with a benchmark against PoE, as well as with an ablation study. We also show that AoE achieves the best results on peptide-MHC binding affinity regression. Furthermore, we demonstrate that AVIB can handle missing data sequences at test time. We then leverage the bottleneck posterior distribution learned by AVIB and demonstrate that it can be used to effectively detect OOD amino acid sequences. Our method significantly outperforms the baselines MSP (Hendrycks and Gimpel, 2016), ODIN (Liang et al., 2017) and AVIB-R (Alemi et al., 2018). Interestingly, we observe that generalization to unseen sequences remains a challenging problem for all investigated models. These results are analogous to those of Grazioli et al. (2022b). We believe this drop in performance is due to the sparsity of the observed training sequences. Future work should focus on tackling the problem of generalization by, for example, simulating or approximating the chemical interactions of TCR and peptides (or pMHCs), as well as their 3D structures.
Financial Support: none declared.
Conflict of Interest: none declared.